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Abstract 

In a range of fields including the geosciences, 
molecular biology, robotics and computer vision, 
one encounters problems that involve random 
variables on manifolds. Currently, there is a lack 
of fiexible probabilistic models on manifolds that 
are fast and easy to train. We define an extremely 
fiexible class of exponential family distributions 
on manifolds such as the torus, sphere, and ro¬ 
tation groups, and show that for these distribu¬ 
tions the gradient of the log-likelihood can be 
computed efficiently using a non-commutative 
generalization of the Fast Fourier Transform 
(FFT). We discuss applications to Bayesian cam¬ 
era motion estimation (where harmonic exponen¬ 
tial families serve as conjugate priors), and mod¬ 
elling of the spatial distribution of earthquakes 
on the surface of the earth. Our experimental re¬ 
sults show that harmonic densities yield a signif¬ 
icantly higher likelihood than the best competing 
method, while being orders of magnitude faster 
to train. 


1. Introduction 

Many problems in science and engineering involve ran¬ 
dom variables on manifolds. In the geosciences, for ex¬ 
ample, one deals with measurements such as the locations 
of earthquakes and weather events on the spherical surface 
of the earth. In robotics and computer vision, unobserved 
Lie group transformations such as rotations and rigid-body 
motions play an important role in motion understanding, 
localization and alignment problems. Classical probabilis¬ 
tic models cannot be applied to data on manifolds because 
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these models do not respect the manifold topology (hav¬ 
ing discontinuities at the value 0 = 27r of a circular vari¬ 
able, for instance), and because they are not equivariant (a 
manifold-preserving transformation of the data would take 
the distribution outside the model family). Among mani¬ 
fold distributions there are currently none that are both fiex¬ 
ible and efficiently trainable. 

In this paper we study a very fiexible class of densities 
on compact Lie groups (such as the group of rotations in 
two or three dimensions) and homogeneous spaces (such 
as the circle, torus, and sphere). We refer to these den¬ 
sities as harmonic exponential families, because they are 
based on a generalized form of Fourier analysis known 
as non-commutative harmonic analysis (Chirikjian & Ky- 
atkin, 2001). Specifically, the sufficient statistics of these 
families are given by functions that are analogous to the 
sinusoids on the circle. 

We show that the moment map (Wainwright & Jordan, 
2007) that takes the natural parameters of an exponential 
family and produces the moments of the distribution can be 
computed very efficiently for harmonic exponential fami¬ 
lies using generalized Fast Fourier Transform (FFT) algo¬ 
rithms. This leads directly to a very efficient maximum- 
likelihood estimation procedure that is applicable to any 
manifold for which an FFT algorithm has been developed, 
and that enjoys the convexity and convergence properties 
of exponential families. 

We apply the harmonic exponential family of the sphere to 
the problem of modeling the spatial distribution of earth¬ 
quakes on the surface of the earth. Our model significantly 
outperforms the best competing method (a mixture model), 
while being orders of magnitude faster to train. 

Harmonic exponential families also arise naturally as con¬ 
jugate priors in a Bayesian framework for estimating trans¬ 
formations from correspondence pairs. In this case the 
points on the manifold (the transformations) are not ob¬ 
served directly. Instead we see a pair of vectors x and y 
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— images, point clouds, or other data — that provide in¬ 
formation about the transformation that produced one from 
the other. If we have a prior p{g) over transformations and 
a likelihood p{y\x, g) that measures how likely y is to be 
the ^-transformed version of x, we can consider the pos¬ 
terior over transformations p{g\x^ y). Typically, the pos¬ 
terior turns into a complicated and intractable distribution, 
but we show that if the prior is a harmonic density and the 
likelihood is Gaussian, the posterior distribution is again a 
harmonic density whose parameters are easily obtained us¬ 
ing the generalized FFT algorithm. Furthermore, the global 
mode of this posterior (the optimal transformation) can be 
computed efficiently by performing yet another FFT. 

In this paper we provide both an abstract treatment of the 
theory of harmonic densities that covers a fairly broad class 
of manifolds in a uniform and easily understandable man¬ 
ner, and a concrete instantiation of this theory in the case 
of the rotation groups SO(2) and SO(3), and their homo¬ 
geneous spaces, the circle and sphere 5'^. 

The rest of this paper is organized as follows. We be¬ 
gin by discussing related work in section 2, followed by 
a brief summary of non-commutative harmonic analysis 
for a machine learning audience in section 3. In section 
4, we define harmonic exponential families and present an 
FFT-based maximum-likelihood estimation algorithm. In 
section 5, we show that harmonic exponential families are 
the conjugate priors in the Bayesian transformation esti¬ 
mation problem, and present an efficient MAP inference 
algorithm. Our earthquake modelling experiments are pre¬ 
sented in section 6, followed by a discussion and conclu¬ 
sion. 

2. Related Work 

The harmonic exponential families were first defined ab¬ 
stractly (but not named) by Diaconis (1988). Despite their 
long history and a significant body of literature devoted to 
them (see Mardia & Jupp (1999)), this class of densities 
has remained intractable until now for all but the simplest 
cases. 

In fact, various commonly used distributions on the cir¬ 
cle and the sphere are harmonic densities of low degree. 
Among these are the 2-parameter von-Mises distribution 
on the circle and the 5-parameter Kent distribution on the 
sphere. These are both exponential families, about which 
Mardia & Jupp (1999) write (p. 175): “Although expo¬ 
nential models have many pleasant inferential properties, 
the need to evaluate the normalizing constant (or at least 
the first derivative of its logarithm) can be a practical diffi¬ 
culty.” 

What is a practical difficulty for the unimodal distributions 
with few parameters mentioned above becomes a show- 


stopper for more fiexible exponential families with many 
parameters. The harmonic exponential family for the circle 
is known as the generalized von-Mises distribution, and can 
be defined for any band-limit / order / degree L. However, 
no scalable maximum-likelihood estimation algorithm is 
known. Gatto & Jammalamadaka (2007) (who work only 
with the 4-parameter L = 2 distribution) compute the nor¬ 
malizing constant by a truncated infinite sum, where each 
term involves an expensive Bessel function evaluation. 

Beran (1979) studies an exponential family that is equiva¬ 
lent to the harmonic exponential family on the sphere, but 
does not provide a scalable learning algorithm. At each 
iteration, the proposed algorithm computes all O(T^) mo¬ 
ments of the distribution by numerical integration. This 
requires O(T^) samples per integration, making the per- 
iteration cost 0{L^). Beran further suggests using a second 
order optimization method, which would further increase 
the per-iteration cost to 0{L^). 

This is clearly not feasible when L is measured in the hun¬ 
dreds, and parameter counts in the lO’s of thousands, as is 
needed in the experiments reported in section 6. The al¬ 
gorithm described in this paper is simple, generic across 
manifolds, and fast (per-iteration complexity 0{L‘^ log^ L) 
in the spherical case, for instance). It can be applied to any 
manifold for which an FFT has been developed. 

3. Preliminaries 

The manifolds we consider in this paper are either Lie 
groups or closely related manifolds called homogeneous 
spaces, and the sufficient statistics that we use come from 
harmonic analysis on these manifolds. Since these con¬ 
cepts are not widely known in machine learning, we will 
review them in this section. For more details, we refer 
the reader to Chirikjian & Kyatkin (2001); Sugiura (1990); 
Kondor (2008); Goodman & Wallach (2009). 

3.1. Lie Groups 

A transformation group G is a set of invertible transforma¬ 
tions that is closed under composition and taking inverses: 
for any G G, the composition gh\^ again a member of 
G, and so are the inverses g~^ and h~^. 

A Lie group is a group that is also a differentiable mani¬ 
fold. For example, the group SO(3) of 3D rotations is a 
Lie group. It can be represented as a set of matrices, 

SO(3) = {Re I RR^ = /, det(i?) = 1}, (1) 

so one way to think about SO(3) is as a 3-dimensional man¬ 
ifold embedded in For technical reasons, we will 

further restrict our attention to compact Lie groups, i.e. Lie 
groups that are closed and bounded. 
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3.2. Harmonic analysis on compact Lie groups 

The basic idea of the generalized Fourier transform on 
compact Lie groups is to expand a function f : G ^ C 
as a linear combination of carefully chosen basis functions. 
These basis functions have very special properties, because 
they are the matrix elements of irreducible unitary repre¬ 
sentations (lURs) of G. These terms will now be explained. 

A representation of a group G on a vector space y is a 
map R from the group to the set of invertible linear trans¬ 
formations of V that preserves the group structure in the 
following sense: 


R{gh)=R{g)R{h). (2) 

Note that gh denotes composition of group elements while 
R{g)R{h) denotes matrix multiplication (once we choose 
a basis for V, that is). 

In computer vision, we encounter group representations in 
the following way. An image is represented as a vector 
X e of pixel intensities, and to be concrete, we take G 
to be the group SO(2) consisting of rotations of the plane. 
Then R{0) is the matrix such that R{0)x is the image x 
rotated by angle 0. 

As this example shows, many representations do not 
change the norm of the vectors on which they act: G 

G,Vx G V : ||f/(^)x|| = ||x||. Such representations are 
called unitary. Unitary representations tend to be easier to 
work with both analytically and numerically. 

Given a unitary representation U and a unitary matrix 
F, one can define an equivalent representation T{g) = 
F~^U{g)F. In computer vision one can think of F as a 
matrix containing image features in the rows. One can now 
try to find F such that for every g, the matrix F~^U{g)F 
is block diagonal with the same block structure. If we con¬ 
tinue to block-diagonalize until no further diagonalization 
is possible, we end up with blocks called irreducible uni¬ 
tary representations ^. The lURs of a compact group can be 
indexed by a discrete index A, and we denote the lURs as 
uyg). 

As an example, consider the 2D rotation group SO(2). 
Since SO(2) is commutative, its representation matrices 
can be jointly diagonalized and so the lURs of SO(2) are 
1x1 matrices: 

UUg) = (3) 

They satisfy the composition rule 

which is the manifestation of eq. 2 for this representation. 
The standard Fourier series of a function on the circle is an 
expansion in terms of these matrix elements, which shows 

^Technically, we have defined the slightly easier to understand 
notion of indecomposability, which in this context implies irre- 
ducibility. 


that standard Fourier analysis is a special case of the more 
general transform to be defined shortly. 

The matrix elements of lURs of SO(3) are known as 
Wigner D-functions. They are defined for A = 0,1, 2,... 
and —A < m, n < A, so the irreducible representations 
are (2A + 1)-dimensional. Wigner D-functions can be ex¬ 
pressed as sums over products of sinusoids or complex ex¬ 
ponentials (Pinchon & Hoggan, 2007), but the formulae are 
somewhat unwieldy so that it is easier to think only about 
their general properties. 

The most important general property of the matrix elements 
of lURs is that they are orthogonal: 


JG 




(4) 
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Here g is the normalized Haar measure, which is the natural 
way to measure volumes in G (Sugiura, 1990), and dim A 
is the dimension of the representation. One can verify that 
the complex exponentials are indeed orthonormal with 
dg{g) = ^ and dim A = 1. 

Intuitively, the matrix elements are like a “complete or¬ 
thogonal basis” for the space L^(G) of square integrable 
functions on G. That is, it can be proven that any function 
/ G L^(G) can be written as 


fig) = 'F'l2dmnTmnig) = ^v] (s), (5) 

A mn 


where T^^{g) = \/dimXU^^{g) are the F^-normalized 
matrix elements and p are the Fourier coefficients of /. 

Integrating eq. 5 against a matrix element and using or¬ 
thonormality, we find: 

vXn = [ f{9)TXn{9)dg{g)- ( 6 ) 

JG 

This is the (generalized) Fourier transform for compact 
groups, denoted 


Fast and exact algorithms for the computation of Fourier 
coefficients from samples of bandlimited functions on the 
rotation groups SO(2) and SO(3) have been developed, 
and the theory required to construct such algorithms for 
general compact Lie groups is understood (Maslen & 
Rockmore, 1997). The group SO(2) is isomorphic to the 
circle, so for G = SO(2) equation 5 reduces to a stan¬ 
dard Fourier series on the circle, for which the well-known 
0(F log F) FFT algorithm can be used. The SO(3) FFT 
has complexity 0{L^ log^ L) for bandlimit L (Maslen & 
Rockmore, 1997; Kostelec & Rockmore, 2008; Potts et al.. 
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2009). This is a tremendous speedup compared to the naive 
0{L^) algorithm, and the algorithms presented in this pa¬ 
per would certainly not be feasible without the FFT. 

In section 4 we discuss how these generalized FFT algo¬ 
rithms can be used to efficiently compute moments, but 
first we discuss the generalization of the Fourier transform 
on compact Lie groups to the Fourier transform on certain 
manifolds that are not groups. 

3.3. Harmonic analysis on homogeneous spaces 

A homogeneous space for a Lie group G is a manifold H 
such that for any two points y G H wc can find a trans¬ 
formation g ^ G with gx = y. For example, the plane 
is a homogeneous space for the translation group, and the 
sphere is a homogeneous space for the 3D rotation group. 
The plane is not a homogeneous space for the 2D rotation 
group, because points at different radii cannot be rotated 
into each other. 

If we pick an origin o G H, such as the north pole of the 
sphere, we can identify any other point h e H hy spec¬ 
ifying how to transform the origin to get there: h = go. 
This identification will not be unique, though, if there is 
a nontrivial subgroup K of G containing transformations 
that leave the origin invariant: K = {k ^ G\ko = o}. 
This is because if h = go, then also h = gko, so both 
g and gk identify h. On the sphere for example, we can 
transform the north-pole into a point h by first doing an 
arbitrary rotation around the north-pole axis (which leaves 
the north-pole unchanged) and then rotating the result to h. 

Hence, one can think of the points h = go in 3. homoge¬ 
neous space H as sets gK = {gk\k G K} (called cosets) 
of group elements that are equivalent with respect to their 
effect on an arbitrarily chosen origin o of H. It follows 
that one can think of functions on a homogeneous space as 
functions on the group, with the special property that they 
are (right) invariant to transformations from K : 

f{gk)=f{g)Vg€G,k€K, (7) 

because right-multiplication by k will only shuffle the ele¬ 
ments within each coset. Finally, one can show (see sup¬ 
plementary material) that in a suitable basis, a subset of the 
matrix elements of lURs form a basis for the linear space of 
square-integrable functions on G with this invariance prop¬ 
erty, which allows us to define the Fourier transform also 
for functions on H. 

The exact same equations (6 and 5) that define the Fourier 
and inverse Fourier transform for a compact Lie group, de¬ 
fine these transforms for a compact homogeneous space, 
but only a subset of the coefficients will be non-zero. 
For the sphere, the matrix elements T^q equal^ to the 

^Various normalization and phase conventions are in use for 


spherical harmonics which form a basis for 
Fast spherical Fourier transform algorithms were devel¬ 
oped by Driscoll & Healy (1994). 

3.4. Exponential families 

An exponential family is a class of densities of the form: 

P(5l7) = Lexp(7?-T(5)). (8) 

ZjrfJ 

It is determined by a choice of sufficient statistics T, that 
take the random variable g and produce a vector of real 
statistics T{g). 

To learn the parameters p, one can perform gradient-based 
optimization of the log-likelihood of a set of iid samples 
5 ^ 1 , • • •, The gradient is the moment discrepancy: 

V,, =f -Ep^g\^^[T{g)], (9) 

where T = ^ empirical moments. 

There is generally no closed form for the analytical mo¬ 
ments (the expectation in eq. 9), so a numerical approxi¬ 
mation is needed. 

4. Harmonic Exponential Families 

We define a harmonic exponential family on a group or ho¬ 
mogeneous space as an exponential family where the suffi¬ 
cient statistics are given by a finite number of matrix el¬ 
ements of lURs. This makes sense only if the function 
p • T{g) is real-valued, so that it can be interpreted as an 
unnormalized log-probability. The easiest way to guaran¬ 
tee this is to take p to be real, and to use real functions 
obtained as a sparse linear combination of complex ma¬ 
trix elements T{g) as sufficient statistics. For example, 
i(g2A0 ^ e“*^^) = cos{X 0 ). From here on, we take T 
to be real, -normalized functions and T the expansion 
of a real function in terms of real basis functions T. 

The key observation that leads to an efficient algorithm for 
computing the moments of a harmonic density is that the 
moments of such a density are its Fourier coefficients: 

^p{gh) [^( 5 )] = [ P{g\v) T{g) dii{g) = T -p. ( 10 ) 

Jg 

Hence, one can obtain all J moments at once by sampling 
p on a finite grid and then computing its Fourier transform 
using a fast algorithm. As explained in section 4.1, the 
discretization error can be made extremely small using only 
0{J) spatial samples. 

the spherical harmonics, but it is enough to know that oc T^o- 





Harmonic Exponential Families 


However, even evaluating p at a single position takes 0{J) 
computations when using J sufficient statistics so that the 
overall complexity is still Furthermore, in order to 

evaluate p we need to know the normalizing constant . 

The following derivation shows that we can work with the 
unnormalized density p{g\r]) = exp {g • T{g)) instead: 

= f Pial v) T^nia) Md) 

JG 

= ^\ a^{a\r])T^n{a)dii{g) ( 11 ) 

JG 

The last step uses the fact that Too(5') = 1 so that [^(/:?]oo 
is equal to the normalizing constant: 

= / ^ia\ v) Tooia) dfi{g) = ( 12 ) 

JG 

Next, observe that we can evaluate In p efficiently at 0(J) 
spatial points using the inverse FFT: 

\n^{g\7i) = iq-T{g)=[F~^g\{g) ( 13 ) 

This computation is exact, because the log-density is ban- 
dlimited (i.e. there are only finitely many parameters). 
Element-wise exponentiation then gives us p evaluated on 
a grid. 

So we have an efficient algorithm for computing moments: 

1. Compute if = exp (^^“^ 77 ). 

2. Compute M = T p 

3 . Compute [2^(5)] = M/<o. 

To make this computation numerically stable for highly 
peaked densities, one should apply the “log-Fourier-exp” 
trick described in the supplementary material. 

4.1. Approximation quality 

Even though the Fourier coefficients are defined as defi¬ 
nite integrals (eq. 6 ), the discrete FFT algorithms com¬ 
pute exact Fourier coefficients, provided the function from 
which the discrete samples were gathered is handlimited. 
A function is handlimited if the coefficients are zero 
for A greater than the band-limit L. Although the function 
ln(p{g) = g • T{g) is handlimited, the function p{g) = 
exp {g 'T{g)) is not, so the computed coefficients are not 
exactly equal to the Fourier coefficients of cp. 

However, the function (p{g) is smooth (infinitely differen¬ 
tiable), and a standard result in Fourier analysis shows that 


the spectrum of a smooth function decays to zero asymptot¬ 
ically faster than 0(1/A’^) for any n. So our function will 
be “effectively handlimited”, in the sense that coefficients 
for A greater than some pseudo-bandlimit will have negli¬ 
gible values. If L is the maximum degree of the sufficient 
statistics (the bandlimit of 77 • T(^)), one can obtain near- 
exact moments by computing the FFT up to the pseudo- 
bandlimit aL for some oversampling factor a. In practice, 
we use values for a ranging from 2 to 5. 

5 . Harmonic Densities as Conjugate Priors 

In this section we discuss the Bayesian transformation in¬ 
ference problem, where the goal is to infer a posterior over 
a Lie group of transformations given only a set of corre¬ 
spondence pairs (such as images before and after a camera 
motion). It turns out that the harmonic exponential fami¬ 
lies are the conjugate priors for this problem, and again, the 
generalized FFT is key to performing efficient inference. 

The observed data in the Bayesian transformation inference 
problem are pairs of vectors (x, y) in that could rep¬ 
resent images, space-time blocks of video, point-clouds, 
optical-flow fields, fitted geometric primitives, parameters 
of a function or other objects. In order to infer anything 
about a latent transformation g, we must know the group 
representation R{g) that acts on the observed data. If our 
data is an image x : M? ^ R, we get a representation on 
the Hilbert space in which x lives: [R{g)x]{p) = x{g~^p), 
where 79 is a point in the plane. In the finite-dimensional 
analogue, where x is a vector of pixel intensities, R{g) will 
be close to a permutation matrix that takes each pixel to its 
proper new position. For compact groups^ this represen¬ 
tation is unitary, and this is what we will assume for R{g) 
from now on. If the representation is not known in advance, 
it can also be learned from data (Cohen & Welling, 2014). 

If we assume that observation x is the ^-transformed ver¬ 
sion of y with some independent Gaussian noise with vari¬ 
ance added, the likelihood function is given by 

p{x \y,g)= M{x I R{g)y, ct^). (14) 

As discussed in section 3.2, we can bring R in block- 
diagonal form by a unitary change of basis: U{g) = 
F~^R{g)F. The matrix U is block diagonal, and the 
blocks are equal to the -normalized sufficient statis¬ 
tics up to a scale factor: T^{g) = Vdim XU^{g). To 
simplify the computations, we shall work with data in this 
new basis: x = Fx so thatp(f \y, g) = M{x \ U{g)y, cr^). 

If we now choose as prior p{g) sl member of the harmonic 
exponential family on G, the posterior p{g\x, y) is of the 

^This more generally true for unimodular groups. 
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same form as the prior: 
p{g \x,y)(x p(x I y, g)p{g) 

« exp - U{g)yf + y ■ T{g)^ 

^9)y + V ■ T{g)j 

(15) 

i.e. we have a conjugate prior. The derivation relies on the 
unitarity of the representation: in expanding \\x-U{g)y\\\ 
we find a term \\U{g)y\\‘^ which is equal to \\y\\‘^, making 
the dependence on U{g) linear (as it is in the prior). 

5.1. Example: Bayesian analysis of camera rotation 

To make matters concrete, we show how to compute a 
posterior over the rotation group SO(3) given two images 
taken before and after a camera rotation. An image is mod¬ 
eled as a function x : 5'^ ^ M on the sphere, so that a 
camera rotation g G SO(3) acts by rotating this function 
over the sphere: [R{g)x\{p) = x{g~^p). We parameterize 
points p G 5'^ as p = (p, 0) for p G [0, 2ti] and 0 G [0, tt]. 
Recall from section 3.3 that we can represent p G 5'^ by 
a coset representative g^ G SO(3), which we parameterize 
using Euler angles as Pp = (p, 6>,0). The transformation 
Pp takes the origin of the sphere to p. 

It is well known^ that in this context the matrix F — 
defined in the previous section as the matrix that block- 
diagonalizes the representation R — is given by the spher¬ 
ical Fourier transform F, which can be computed by an 
EFT. This means that if we represent x by its Fourier co¬ 
efficients X = Fx,i\\o coefficients of the rotated function 
R{g)x{p) = x{g~^p) are given by U{g)x, where U(g) is a 
block-diagonal matrix with irreducible representations 
as blocks. As derived in the previous section, the parame¬ 
ters of the posterior can easily be obtained in this basis by 
a block-wise outer product. 

Figure 1 shows the posterior p(p, | x, p) for two synthetic 
spherical images x^ and and their rotations y^ = x^ 
(no rotation) and = f/(0,7r/3,7r/2)x^. The posterior is 
plotted as a 3D isocontour in the ZYZ-Euler angle param¬ 
eter space G [0,27r] x [0,7r] x [0,27r]. Although 

this plot looks like a box, the actual manifold has the topol¬ 
ogy of a projective 3-sphere. By construction, the density 
is spread through the parameter space in a way that is con¬ 
sistent with this topology: no discontinuities arise where 
wraparound in the parameter space occurs. This is desir¬ 
able, because the wraparound is not an intrinsic property of 
the manifold. 

"^The derivation can be found in the supplementary material. 



ooco 

Figure 1 . Posterior distributions (top) for two correspondence 
pairs (bottom). 

Due to the symmetry of these figures certain rotations can¬ 
not be distinguished from the “true” rotation, so that the 
modes of the posterior distributions are supported on an 
entire subgroup of SO(3). Real images will not have such 
a high degree of symmetry, but it will nevertheless often be 
the case that a unique optimal transformation does not ex¬ 
ist (Ma et al., 1999). Indeed, current keypoint based trans¬ 
formation estimation methods can easily get confused by 
repeating structures in an image, such as several identical 
windows on a building. Although more experimental work 
is needed, our method has the theoretical advantage that be¬ 
sides keypoints (which form a group representation), it can 
make use of parts of the image that do not allow for key- 
points to be reliably placed (such as edges), while always 
providing a truthful impression of the degree to which a 
unique transformation or subgroup can be identified form 
the data. 

5.2. MAP inference 

As Bayesians we are done here, but for some applications 
one may wish to find a single most likely transformation. 
To find the optimal transformation, first perform posterior 
inference (steps 1 and 2) and then maximize (step 3): 

1. Compute X = Fx and y = Fy. 

2. Compute + -^-^^xxyl 

3. Compute ^* = argmax4^“^^](^i) 

The arg max ranges over all the points in a finite grid on 
G used by the FFT synthesis F~^. Optionally, one can re¬ 
fine the optimum by performing a few steps of gradient- 
based optimization on • T( 5 f) to get sub-pixel accuracy. 
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6. Experiments 

6.1. Modelling the spatial distribution of earthquakes 

We compare our model and MLE algorithm to a Kent Mix¬ 
ture Model (KMM) on the problem of modelling the spatial 
distribution of significant earthquakes on the surface of the 
earth. 

We obtained the Significant Earthquake Dataset (NGDC, 
2015) from the National Geophysical Datacenter of the Na¬ 
tional Oceanographic and Athmospheric Administration. 
In total, the dataset contains 5780 earthquakes with com¬ 
plete information on the position of their epicenter, and 53 
earthquakes whose coordinates are missing (these were dis¬ 
carded in our experiments). We did not model the severity 
of the earthquake, but only the occurrence of significant 
earthquakes (as defined by (NGDC, 2015)). 

6.1.1. Mixture of Kent distributions 

The 5-parameter Kent distribution (Mardia & Jupp, 1999) 
is the spherical analogue of the normal distribution with un¬ 
constrained covariance. Being unimodal, the Kent distribu¬ 
tion is not flexible enough to describe complicated distri¬ 
butions such as the spatial distribution of earthquakes. The 
most fiexible distribution on the sphere that we have found 
in the literature is the Kent Mixture Model, first described 
by Peel et al. (2001). The KMM is trained using the EM 
algorithm. We use the open source Python implementation 
of the EM algorithm for KMMs by Hofer (2014). 

Unlike the harmonic densities, the log-likelihood of this 
model is not convex and contains many singularities where 
a mixture component concentrates on a single data point 
and decreases its variance indefinitely. Eor this reason, 
we perform randomly initialized restarts until the algorithm 
has found 10 non-degenerate solutions, of which we retain 
the one with the best cross-validation log-likelihood. No 
regularization was used, because for the models that could 
be trained within a reasonable amount of time, no overfit¬ 
ting was observed. 

6.1.2. The S ‘^ harmonic density 

The harmonic density on the 2-sphere uses spherical har¬ 
monics as sufficient statistics. The empirical moments are 
easily computed using standard spherical harmonic rou¬ 
tines, but we found that for high orders the SciPy rou¬ 
tines are slow and numerically unstable. The supplemen¬ 
tary material describes a simple, fast, and stable method 
for the evaluation of spherical harmonics. The computa¬ 
tion of spherical harmonics up to band-limit L = 200 (for 
a total of {L -f 1)^ = 40401 spherical harmonics) for 5780 
points on the sphere took half a minute using this method 
and is performed only once for a given dataset. 


Eor regularization we use a diagonal Gaussian prior on r], 
where the precision corresponding to the coefficient of 
is given by = a dim A = a{2X + 1 ) (for some 
regularization parameter a). This scheme is inspired by 
the fact that dim A is the discrete Plancherel measure (Sug- 
iura, 1990), and the empirical observation that the fitted co¬ 
efficients become approximately uniform when weighted 
as 77 ^ Vdim A. Note that adding regularization does not 
change the convexity of the objective function. 

To find maximum a posteriori parameters fj for the spher¬ 
ical harmonic density, we perform iterative gradient-based 
optimization on the log-posterior. The gradients (moment 
discrepancies) are computed using the EET-based method 
described in section 4. We use the spherical EET algorithm 
implemented in the NEET library (Keiner et al., 2009; Ku- 
nis & Potts, 2003). The gradients are fed to a standard im¬ 
plementation of the L-BEGS algorithm. 

6.1.3. Results 

Eigure 2 shows the average train and test log-likelihood 
over 5 cross-validation folds, for the spherical harmonic 
density and the mixture of Kent distribution. The plotted 
values correspond to the regularization settings that yielded 
the best test log-likelihood. 

The KMM reached an average test log-likelihood of —0.37 
(with standard deviation of 0.03 over 5 cross-validation 
folds) using 70 mixture components (5 x 70 + 69 = 419 
parameters). The harmonic density reached an average test 
log-likelihood of +0.3 (with standard deviation of 0.036 
over 5 cross-validation folds), using bandlimit L = 140 
(19880 parameters). The HD still outperforms the KMM 
when given a similar number of parameters: for L = 20 
(440 « 419 parameters), the log-likelihood is —0.28 > 
—0.37, with standard deviation 0.028, and for L = 19 (399 
parameters), the log-likelihood is —0.3 with standard devi¬ 
ation 0.03. The dataset and the learned densities are plotted 
in figure 3, clearly showing the superiority of the harmonic 
density. 

7. Discussion and Future Work 

What could explain the difference in log-likelihood be¬ 
tween our model and the KMM? We believe two inter¬ 
related factors are driving this difference: the expressive¬ 
ness of the model and the ease of optimization. Leaving 
technicalities aside, it is clear that both the spherical har¬ 
monic exponential family and the KMM can approximate 
any well-behaved density, given enough parameters and an 
optimization oracle. However, as is clear from figure 2, 
training time becomes prohibitive for the KMM for more 
than 70 mixture components / 419 parameters, while the 
harmonic density can efficiently be fit for tens of thousands 
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Figure 2. Top: cross-validation train (dashed line) and test (solid 
line) log-likelihood for the Harmonic Density (HD) and Kent 
Mixture Model (KMM). Bottom: number of parameters versus 
training time for both models. 


of parameters. 

Furthermore, the KMM training algorithm (EM) can easily 
get stuck in local optima, or converge on a degenerate so¬ 
lution. This is the main reason for the poor runtime perfor¬ 
mance; while the KMM code could be further optimized, it 
is the fact that so many restarts are required to find a good 
fit that makes the algorithm slow. The log-likelihood func¬ 
tion of the harmonic density, on the other hand, is convex, 
and the L-BFGS optimizer will typically converge to the 
global optimum in some 20 — 100 iterations. 

As can be seen in figure 3, the harmonic density produces 
slight ringing artifacts that can be seen only in a log-plot 
such as this. These are the result of the limited bandwidth 
of the log-density, and will become progressively less pro¬ 
nounced as the number of parameters is increased. While 
they are clearly visible in log-space, the actual difference 
between peaks and valleys is on the order of 10“^ for ban- 
dlimit L = 100. The artifacts are not visible on a non- 
logarithmic plot (and in such a plot the KMM density is 
hardly visible at all when plotted on the same intensity 
scale as the harmonic density, because the peaks are much 
lower). The harmonic density also tends to prefer heavier 
tails, which is probably accurate for many problems. 

An interesting direction for future work is the extension 
to non-compact groups. While the mathematical theory 
becomes much more technical for such groups, (Kyatkin 
& Chirikjian, 2000) have already succeeded in developing 
FFT algorithms for the Euclidean motion group which is 
non-compact. From there it should be relatively straight¬ 
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Figure 3. Log-probability for harmonic density and Kent mixture 
model, plotted using a perceptually accurate linear-lightness col- 
ormap on the same intensity scale. 


forward to develop harmonic densities on the Euclidean 
group. Harmonic densities on the Euclidean, affine or even 
projective group should find many applications in robotics 
and computer vision (see e.g. (Kyatkin & Chirikjian, 
1999)). 

8. Conclusion 

We have studied a class of exponential families on compact 
Lie groups and homogeneous manifolds, which we call har¬ 
monic exponential families. We have shown that for these 
families, maximum likelihood inference, posterior infer¬ 
ence and mode seeking can be implemented using very effi¬ 
cient generalized East Eourier Transform algorithms. In the 
Bayesian setting, we have shown that harmonic exponential 
families appear naturally as conjugate priors in the generic 
transformation inference problem. Our experiments show 
that training harmonic densities is fast even for very large 
numbers of parameters, and that far superior likelihood can 
be achieved using these models. 
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